Search CORE

91 research outputs found

Split and Rephrase

Author: Cohen Shay
Gardent Claire
Narayan Shashi
Shimorina Anastasia
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

We propose a new sentence simplification task (Split-and-Rephrase) where the aim is to split a complex sentence into a meaning preserving sequence of shorter sentences. Like sentence simplification, splitting-and-rephrasing has the potential of benefiting both natural language processing and societal applications. Because shorter sentences are generally better processed by NLP systems, it could be used as a preprocessing step which facilitates and improves the performance of parsers, semantic role labellers and machine translation systems. It should also be of use for people with reading disabilities because it allows the conversion of longer sentences into shorter ones. This paper makes two contributions towards this new task. First, we create and make available a benchmark consisting of 1,066,115 tuples mapping a single complex sentence to a sequence of sentences expressing the same meaning. Second, we propose five models (vanilla sequence-to-sequence to semantically-motivated models) to understand the difficulty of the proposed task.Comment: 11 pages, EMNLP 201

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

Edinburgh Research Explorer

Local String Transduction as Sequence Labeling

Author: Carreras Xavier
Cohen Shay
Narayan Shashi
Ribeiro Joana
Publication venue
Publication date: 01/08/2018
Field of study

[EN]We show that the general problem of string transduction can be reduced to the problem of sequence labeling. While character deletion and insertions are allowed in string transduction, they do not exist in sequence labeling. We show how to overcome this difference. Our approach can be used with any sequence labeling algorithm and it works best for problems in which string transduction imposes a strong notion of locality (no long range dependencies). We experiment with spelling correction for social media, OCR correction, and morphological inflection, and we see that it behaves better than seq2seq models and yields state-of-the-art results in several cases.Peer reviewe

Edinburgh Research Explorer

Digital.CSIC

Creating Training Corpora for NLG Micro-Planning

Author: Gardent Claire
Narayan Shashi
Perez-Beltrachini Laura
Shimorina Anastasia
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

International audienceIn this paper, we focus on how to create data-to-text corpora which can support the learning of wide-coverage micro-planners i.e., generation systems that handle lexicalisation, aggregation, surface re-alisation, sentence segmentation and referring expression generation. We start by reviewing common practice in designing training benchmarks for Natural Language Generation. We then present a novel framework for semi-automatically creating linguistically challenging NLG corpora from existing Knowledge Bases. We apply our framework to DBpedia data and compare the resulting dataset with (Wen et al., 2016)'s dataset. We show that while (Wen et al., 2016)'s dataset is more than twice larger than ours, it is less diverse both in terms of input and in terms of text. We thus propose our corpus generation framework as a novel method for creating challenging data sets from which NLG models can be learned which are capable of generating text from KB data

Crossref

INRIA a CCSD electronic archive server

Edinburgh Research Explorer

Error Mining with Suspicion Trees: Seeing the Forest for the Trees

Author: Gardent Claire
Narayan Shashi
Publication venue: HAL CCSD
Publication date: 08/12/2012
Field of study

International audienceIn recent years, error mining approaches have been proposed to identify the most likely sources of errors in symbolic parsers and generators. However the techniques used generate a flat list of suspicious forms ranked by decreasing order of suspicion. We introduce a novel algorithm that structures the output of error mining into a tree (called, suspicion tree) highlighting the relationships between suspicious forms. We illustrate the impact of our approach by applying it to detect and analyse the most likely sources of failure in surface realisation; and we show how the suspicion tree built by our algorithm helps presenting the errors identified by error mining in a linguistically meaningful way thus providing better support for error analysis. The right frontier of the tree highlights the relative importance of the main error cases while the subtrees of a node indicate how a given error case divides into smaller more specific case

CiteSeerX

INRIA a CCSD electronic archive server

A Generation Framework for Grammar Development

Author: Gardent Claire
Narayan Shashi
Publication venue
Publication date: 06/03/2015
Field of study

Edinburgh Research Explorer

Error Mining on Dependency Trees

Author: Gardent Claire
Narayan Shashi
Publication venue: HAL CCSD
Publication date: 01/01/2012
Field of study

International audienceIn recent years, error mining approaches were developed to help identify the most likely sources of parsing failures in parsing systems using handcrafted grammars and lexicons. However the techniques they use to enumerate and count n-grams builds on the sequential nature of a text corpus and do not easily extend to structured data. In this paper, we propose an algorithm for mining trees and apply it to detect the most likely sources of generation failure. We show that this tree mining algorithm permits identifying not only errors in the generation system (grammar, lexicon) but also mismatches between the structures contained in the input and the input structures expected by our generator as well as a few idiosyncrasies/error in the input data

CiteSeerX

INRIA a CCSD electronic archive server